We want to know how well the model will perform on data it hasn’t seen before. We use \(k\)-fold cross-validation to do this.
In \(k\)-fold cross-validation, we divide the data randomly into \(k\) approximately equal groups or folds. The schematic here shows 5-fold cross-validation.
The model is fit on \(k-1\) of the folds and the remaining fold is used to evaluate the model. Let’s look at the first row in the schematic. Here the model is fit on the data that are in folds 2, 3, 4, and 5. The model is evaluated on the data in fold 1.
RMSE is a common performance metric for models with a quantitative response. It is computed by taking the difference between the predicted and actual response for each observation, squaring it, and taking the square root of the average over all observations.
So, again looking at the first row in the schematic, the model is fit to folds 2, 3, 4, and 5 and we would use that model to compute the RMSE for fold 1. In the second row, the model is fit to the data in folds 1, 3, 4, and 5 and that model is used to compute the RMSE for the data in the 2nd fold.
After this is done for all 5 folds, we take the average RMSE, to obtain the overall performance. This overall error is sometimes called the CV error. Averaging the performance over \(k\) folds gives a better estimate of the true error than just using one hold-out set. It also allows us to estimate its variability.